Variational joint self‐attention for image captioning

نویسندگان

چکیده

The image captioning task has attracted great attention from many researchers, and significant progress been made in the past few years. Existing models, which mainly apply attention-based encoder-decoder architecture, achieve developments captioning. These however, are limited caption generation due to potential errors resulting inaccurate detection of objects incorrect objects. To alleviate limitation, a Variational Joint Self-Attention model (VJSA) is proposed learn latent semantic alignment between given its label description for guiding better Unlike existing VJSA first uses self-attention module encode effective relationship information intra-sequence inter-sequences relationships. And then variational neural inference learns distribution over corresponding description. In decoding, learned guides decoder generate higher quality caption. results experiments reveal that outperforms compared performances various metrics show feasible generation.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Joint Learning of CNN and LSTM for Image Captioning

In this paper, we describe the details of our methods for the participation in the subtask of the ImageCLEF 2016 Scalable Image Annotation task: Natural Language Caption Generation. The model we used is the combination of a procedure of encoding and a procedure of decoding, which includes a Convolutional neural network(CNN) and a Long Short-Term Memory(LSTM) based Recurrent Neural Network. We f...

متن کامل

Contrastive Learning for Image Captioning

Image captioning, a popular topic in computer vision, has achieved substantial progress in recent years. However, the distinctiveness of natural descriptions is often overlooked in previous work. It is closely related to the quality of captions, as distinctive captions are more likely to describe images with their unique aspects. In this work, we propose a new learning method, Contrastive Learn...

متن کامل

Stack-Captioning: Coarse-to-Fine Learning for Image Captioning

The existing image captioning approaches typically train a one-stage sentence decoder, which is difficult to generate rich fine-grained descriptions. On the other hand, multi-stage image caption model is hard to train due to the vanishing gradient problem. In this paper, we propose a coarse-to-fine multistage prediction framework for image captioning, composed of multiple decoders each of which...

متن کامل

Phrase-based Image Captioning

Generating a novel textual description of an image is an interesting problem that connects computer vision and natural language processing. In this paper, we present a simple model that is able to generate descriptive sentences given a sample image. This model has a strong focus on the syntax of the descriptions. We train a purely bilinear model that learns a metric between an image representat...

متن کامل

Domain-Specific Image Captioning

We present a data-driven framework for image caption generation which incorporates visual and textual features with varying degrees of spatial structure. We propose the task of domain-specific image captioning, where many relevant visual details cannot be captured by off-the-shelf general-domain entity detectors. We extract previously-written descriptions from a database and adapt them to new q...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Iet Image Processing

سال: 2022

ISSN: ['1751-9659', '1751-9667']

DOI: https://doi.org/10.1049/ipr2.12470